DarkWeb代表了一个用于非法活动的温床,用户在不同的市场论坛上进行交流以交换商品和服务。执法机构从执行作者分析的法医工具中受益,以根据其文本内容识别和配置用户。然而,传统上使用文学文本(例如小说或粉丝小说中的片段)对作者身份分析进行了研究,这些文字在网络犯罪背景下可能不合适。此外,使用撰稿人分析工具进行网络犯罪的少数作品通常采用临时实验设置和数据集。为了解决这些问题,我们发布了Veridark:由三个大规模作者身份验证数据集和一个从用户活动中从黑暗网络相关的Reddit社区或流行的非法黑暗网络市场论坛获得的基准组成的基准。我们在三个数据集上评估竞争性NLP基准,并对预测进行分析,以更好地了解此类方法的局限性。我们在https://github.com/bit-ml/veridark上公开提供数据集和基线
translated by 谷歌翻译
光学特征识别系统的重要初步步骤是检测文本行。为了在缺少标签的历史数据的背景下解决此任务,我们提出了一种能够提高行检测性能的自定进度学习算法。我们猜想,具有更多地面界限框的页面不太可能缺少注释。基于这个假设,我们就基面框的数量按降序排序训练示例,并将其组织成K批次。使用我们的自定进度学习方法,我们在K迭代中训练一排探测器,逐渐添加了较少的接地注释的批次。在每次迭代中,我们使用非最大最大抑制作用将地面真相边界的边界框与伪装框(由模型本身预测的边界框)组合在一起,并在下一次训练迭代中包括所得的注释。我们证明,我们的自进度学习策略在两个历史文档的两个数据集上带来了显着的绩效提高,从而提高了Yolov4的平均精度,一个数据集超过12%,另一个数据集则超过39%。
translated by 谷歌翻译
识别文本跨越几十年的作者的任务,并使用语言学,统计数据,更新,最近,机器学习。灵感灵感来自广泛的自然语言处理任务的令人印象深刻的性能增益,并通过持续的潘大型作者数据集的可用性,我们首先研究几个伯特式变压器的有效性,以便为作者验证的任务。这些模型证明了始终如一地达到非常高的分数。接下来,我们经验证明他们专注于局部线索而不是作者写作风格特征,利用数据集中的现有偏差。为了解决这个问题,我们为PAN-2020提供了新的分割,其中培训和测试数据从不相交的主题或作者采样。最后,我们介绍了DarkRedDit,一个具有不同输入数据分发的数据集。我们进一步使用它来分析低数据制度中模型的域泛化性能,以及在使用所提出的PAN-2020分割时如何变化,以进行微调。我们表明这些分割可以提高模型的模型,以通过新的,显着不同的数据集传输知识。
translated by 谷歌翻译
在本文中,我们建议对罗马尼亚的Covid19的演变进行三个阶段分析。关于大流行预测,有两个主要问题。第一个是事实,即受感染和恢复的数量是不可靠的,但是死亡人数更准确。第二个问题是有许多因素影响了大流行的演变。在本文中,我们提出了三个阶段的分析。第一阶段是基于我们使用神经网络进行的经典SIR模型。这提供了第一组每日参数。在第二阶段,我们提出了对SIR模型的改进,其中我们将死者分为不同的类别。通过使用第一个估计和网格搜索,我们每天对参数进行估计。第三阶段用于定义参数的转折点(本地极端)的概念。我们将这些要点之间的时间称为政权。我们根据SIRD的时间变化参数来概述一种通用方式,以进行预测。
translated by 谷歌翻译
Although Reinforcement Learning (RL) has shown impressive results in games and simulation, real-world application of RL suffers from its instability under changing environment conditions and hyperparameters. We give a first impression of the extent of this instability by showing that the hyperparameters found by automatic hyperparameter optimization (HPO) methods are not only dependent on the problem at hand, but even on how well the state describes the environment dynamics. Specifically, we show that agents in contextual RL require different hyperparameters if they are shown how environmental factors change. In addition, finding adequate hyperparameter configurations is not equally easy for both settings, further highlighting the need for research into how hyperparameters influence learning and generalization in RL.
translated by 谷歌翻译
Active learning as a paradigm in deep learning is especially important in applications involving intricate perception tasks such as object detection where labels are difficult and expensive to acquire. Development of active learning methods in such fields is highly computationally expensive and time consuming which obstructs the progression of research and leads to a lack of comparability between methods. In this work, we propose and investigate a sandbox setup for rapid development and transparent evaluation of active learning in deep object detection. Our experiments with commonly used configurations of datasets and detection architectures found in the literature show that results obtained in our sandbox environment are representative of results on standard configurations. The total compute time to obtain results and assess the learning behavior can thereby be reduced by factors of up to 14 when comparing with Pascal VOC and up to 32 when comparing with BDD100k. This allows for testing and evaluating data acquisition and labeling strategies in under half a day and contributes to the transparency and development speed in the field of active learning for object detection.
translated by 谷歌翻译
Earthquakes, fire, and floods often cause structural collapses of buildings. The inspection of damaged buildings poses a high risk for emergency forces or is even impossible, though. We present three recent selected missions of the Robotics Task Force of the German Rescue Robotics Center, where both ground and aerial robots were used to explore destroyed buildings. We describe and reflect the missions as well as the lessons learned that have resulted from them. In order to make robots from research laboratories fit for real operations, realistic test environments were set up for outdoor and indoor use and tested in regular exercises by researchers and emergency forces. Based on this experience, the robots and their control software were significantly improved. Furthermore, top teams of researchers and first responders were formed, each with realistic assessments of the operational and practical suitability of robotic systems.
translated by 谷歌翻译
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
We study inductive matrix completion (matrix completion with side information) under an i.i.d. subgaussian noise assumption at a low noise regime, with uniform sampling of the entries. We obtain for the first time generalization bounds with the following three properties: (1) they scale like the standard deviation of the noise and in particular approach zero in the exact recovery case; (2) even in the presence of noise, they converge to zero when the sample size approaches infinity; and (3) for a fixed dimension of the side information, they only have a logarithmic dependence on the size of the matrix. Differently from many works in approximate recovery, we present results both for bounded Lipschitz losses and for the absolute loss, with the latter relying on Talagrand-type inequalities. The proofs create a bridge between two approaches to the theoretical analysis of matrix completion, since they consist in a combination of techniques from both the exact recovery literature and the approximate recovery literature.
translated by 谷歌翻译
Metric learning aims to learn distances from the data, which enhances the performance of similarity-based algorithms. An author style detection task is a metric learning problem, where learning style features with small intra-class variations and larger inter-class differences is of great importance to achieve better performance. Recently, metric learning based on softmax loss has been used successfully for style detection. While softmax loss can produce separable representations, its discriminative power is relatively poor. In this work, we propose NBC-Softmax, a contrastive loss based clustering technique for softmax loss, which is more intuitive and able to achieve superior performance. Our technique meets the criterion for larger number of samples, thus achieving block contrastiveness, which is proven to outperform pair-wise losses. It uses mini-batch sampling effectively and is scalable. Experiments on 4 darkweb social forums, with NBCSAuthor that uses the proposed NBC-Softmax for author and sybil detection, shows that our negative block contrastive approach constantly outperforms state-of-the-art methods using the same network architecture. Our code is publicly available at : https://github.com/gayanku/NBC-Softmax
translated by 谷歌翻译